1 Structure

1.1 Column types

# 
# numeric 
#      24

1.2 Missing values, NAs, and Negatives

# [1] "Are all rows complete?: TRUE"
# [1] "Are there any NAs?: FALSE"
# [1] "Are any values negative?: FALSE"

2 Level 0

2.1 Heatmap

The heatmap below is a representation of the data with values shown in color according to magnitude. Mouse hover for column names.

Heatmap

2.2 First Moment Statistics

The column means and medians are presented in combined heatmap and lineplot below.

3 Second Moment Statistics

3.1 Correlation Matrix (heatmap)

The correlation between two random variables is a measure of a specific type of dependence that involves not only the two variables themselves but also a random component. It measures to what degree a linear relationship exists between then two random variables, where 1 is corresponds to a direct linear relationship, 0 corresponds to no linear relationship, and -1 corresponds to an inverse linear relationship.

  1. Correlation
  2. Correlation and dependence
  3. Example graphic

4 Density Estimates

4.1 Histogram (1D Heatmaps)

For each feature column, the data are binned and a heatmap is produced with each bin colored according to count.

4.2 Marginals (2D Heatmaps)

A pairs plot is a popular way of plotting high-dimensional data.
For every pair of dimensions are plotted showing the specific projection of the data along those two dimensions.

For readability a maximum of 8 dimensions are plotted.

The full plot can be downloaded here

5 Outlier Plots

An outlier is a datapoint that lives relatively far away from the bulk of other observations. Outliers can have unwanted effects on data analysis and therefore should be considered carefully.

We use the built-in method from the randomForest package in R.

  1. randomForest
  2. Outlier
  3. randomForest_outlier

6 Cluster Analysis

6.1 BIC Plots

The Bayesian Information Criterion is used to select the model parameters for Mclust.

  1. BIC
  2. List of mclustModelNames p.88

6.2 Binary Hierarhical Mclust classifications

# [1] "Fraction of points in each cluster:"
# 
#      1      2      3      4      5      6      7      8      9     10 
# 0.0209 0.0310 0.0078 0.0116 0.0213 0.0181 0.0384 0.0116 0.0150 0.0056 
#     11     12     13     14     15     16     17     18     19     20 
# 0.0043 0.0036 0.0170 0.0090 0.0169 0.0082 0.0676 0.0422 0.0429 0.0549 
#     21     22     23     24     25     26     27     28     29     30 
# 0.0478 0.0212 0.0461 0.0565 0.0433 0.0817 0.0560 0.0542 0.0663 0.0205 
#     31     32 
# 0.0375 0.0210

The full plot can be downloaded here

6.3 HMClust output: Cluster means

6.4 Cluster Dendrogram

The dendrogram from hierarchical mclust. The default maxiumum tree depth is set to 6.

# 111111 111112 111121 111122 111211 111212 111221 111222 112111 112112 
# 0.0209 0.0310 0.0078 0.0116 0.0213 0.0181 0.0384 0.0116 0.0150 0.0056 
# 112121 112122 112211 112212 112221 112222 121111 121112 121121 121122 
# 0.0043 0.0036 0.0170 0.0090 0.0169 0.0082 0.0676 0.0422 0.0429 0.0549 
# 121211 121212 121221 121222 122111 122112 122121 122122 122211 122212 
# 0.0478 0.0212 0.0461 0.0565 0.0433 0.0817 0.0560 0.0542 0.0663 0.0205 
# 122221 122222 
# 0.0375 0.0210

6.5 HMClust output: Cluster correlation matrices

7 Spectral Analysis

7.1 Cumulative Sum of Variance

The variance measure how spread out the data are from their mean. Cumulative variance measures, as a percentage, how much variation each dimension contributes to the dataset.

In this implementation we use principal components analysis to select linear combinations of the features that explain the dataset best in low dimensions.

The plot below shows how much variance is explained when adding columns one at a time. The elbows denote good “cut-off” points for dimension selection.

  1. Variance
  2. PCA
  3. Elbows

7.2 Right Singular Vectors Heatmap

The right singular vectors of the data matrix are plotted below in a heatmap.

7.3 Right Singular Vectors Pairs plot

The right singular vectors are plotted below in a pairs plot. A maximum of 8 pairs will be plotted for readability.

The full RSV plots can be downloaded here and here.

7.4 3D pca of correlation matrix